Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Model2Vec as an embedding backend #2245

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add Model2Vec as an embedding backend #2245

wants to merge 1 commit into from

Conversation

MaartenGr
Copy link
Owner

What does this PR do?

Add Model2Vec as an incredibly fast but still quite accurate embedding backend.

Usage is straightforward and you first need to install model2vec:

pip install model2vec

Then, you can load in any of their models and pass it to BERTopic like so:

from model2vec import StaticModel
embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")

topic_model = BERTopic(embedding_model=embedding_model)

Distillation

These models are extremely versatile and can be distilled from existing embedding model (like those compatible with sentence-transformers). This distillation process doesn't require a vocabulary (as it uses the tokenizer's vocabulary) but can benefit from having one. Fortunately, this allows you to use the vocabulary from your input documents to distill a model yourself.

Doing so requires you to install some additional dependencies of model2vec like so:

pip install model2vec[distill]

To then distill common embedding models, you need to import the Model2VecBackend from BERTopic:

from bertopic.backend import Model2VecBackend

# Choose a model to distill (a non-Model2Vec model)
embedding_model = Model2VecBackend(
    "sentence-transformers/all-MiniLM-L6-v2", 
    distill=True
)

topic_model = BERTopic(embedding_model=embedding_model)

You can also choose a custom vectorizer for creating the vocabulary and define custom arguments for the distillatio process:

from bertopic.backend import Model2VecBackend
from sklearn.feature_extraction.text import CountVectorizer

# Choose a model to distill (a non-Model2Vec model)
embedding_model = Model2VecBackend(
    "sentence-transformers/all-MiniLM-L6-v2", 
    distill=True,
    distill_kwargs={"pca_dims": 256, "apply_zipf": True, "use_subword": True},
    distill_vectorizer=CountVectorizer(ngram_range=(1, 3))
)

topic_model = BERTopic(embedding_model=embedding_model)

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant